First of all what is LOS? Hospital length-of-stay (LOS) is defined as the time between hospital admission and discharge measured in days.
The goal is to create a model that predicts the length-of-stay for each patient at time of admission.
In order to predict hospital LOS, the MIMIC data needed to be separated into terms of:
Since LOS is not a categorical but continuous variable (measured in days), a regression model will be used for prediction.
The expected outcome is that the model we use will be better at predicting hospital LOS than the industry standards of median and average LOS. The median LOS is simply the median LOS of past admissions to a hospital. Similarly, a second commonly used metric in healthcare is the average, or mean LOS.
So, to measure performance of our model, we'll compare the prediction model against the median and average LOS using the root-mean-square error (RMSE). The RMSE is a commonly used measure of the differences between values predicted by a model and the values observed, where a lower score implies better accuracy. For example, a perfect prediction model would have an RMSE of 0.
The RMSE equation for this work is given as follows, where (n) is the number of hospital admission records, (y-hat) the prediction LOS, and (y) is the actual LOS.

We could say we have a successful model if its prediction results in a lower RMSE than the average or median models.
There is a multitude of regression models available for predicting LOS. To determine the best regression model between the subset of models that will be evaluated, the R2 (R-squared) score will be used.
R Square measures how much variability in dependent variable can be explained by the model. In other words, it is the proportion of the variance in the dependent variable that is predictable from the independent variables. R2 is defined as the following equation where (y_i) is an observed data point, (ŷ) is the mean of the observed data, and (f_i) the predicted model value.

Best possible R2 score is 1.0 and a negative value means it is worse than a constant model, average or median in this case.
dataframe.info()
dataframe.head()
dataframe.describe()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from scipy.stats import pearsonr
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import SGDRegressor
import statsmodels.api as sm
from sklearn.model_selection import GridSearchCV
mimic3_path = '../../mimic-iii-clinical-database-1.4/'
def dataframe_from_csv(path, header=0, index_col=0):
return pd.read_csv(path, header=header, index_col=index_col)
# read admissions table
def read_admissions_table(mimic3_path):
admits = dataframe_from_csv(mimic3_path + 'ADMISSIONS.csv')
# Pre-emptively don't include some columns that I don't need
admits = admits[['SUBJECT_ID', 'HADM_ID', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME', 'ADMISSION_TYPE', 'INSURANCE', 'ETHNICITY', 'DIAGNOSIS']]
admits.ADMITTIME = pd.to_datetime(admits.ADMITTIME)
admits.DISCHTIME = pd.to_datetime(admits.DISCHTIME)
admits.DEATHTIME = pd.to_datetime(admits.DEATHTIME)
return admits
admits = read_admissions_table(mimic3_path)
admits.head()
| SUBJECT_ID | HADM_ID | ADMITTIME | DISCHTIME | DEATHTIME | ADMISSION_TYPE | INSURANCE | ETHNICITY | DIAGNOSIS | |
|---|---|---|---|---|---|---|---|---|---|
| ROW_ID | |||||||||
| 21 | 22 | 165315 | 2196-04-09 12:26:00 | 2196-04-10 15:54:00 | NaT | EMERGENCY | Private | WHITE | BENZODIAZEPINE OVERDOSE |
| 22 | 23 | 152223 | 2153-09-03 07:15:00 | 2153-09-08 19:10:00 | NaT | ELECTIVE | Medicare | WHITE | CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS... |
| 23 | 23 | 124321 | 2157-10-18 19:34:00 | 2157-10-25 14:00:00 | NaT | EMERGENCY | Medicare | WHITE | BRAIN MASS |
| 24 | 24 | 161859 | 2139-06-06 16:14:00 | 2139-06-09 12:48:00 | NaT | EMERGENCY | Private | WHITE | INTERIOR MYOCARDIAL INFARCTION |
| 25 | 25 | 129635 | 2160-11-02 02:06:00 | 2160-11-05 14:55:00 | NaT | EMERGENCY | Private | WHITE | ACUTE CORONARY SYNDROME |
The LOS is not explicitly expressed as attribute in the admission table, so we have to calculate it. As we said, LOS is defined as the time between admission and discharge from the hospital.
# Create LOS attribute converting timedelta type into float 'days', 86400 seconds in a day
admits['LOS'] = (admits['DISCHTIME'] - admits['ADMITTIME']).dt.total_seconds()/86400
# Verify LOS computation
admits[['ADMITTIME', 'DISCHTIME', 'LOS']].head()
| ADMITTIME | DISCHTIME | LOS | |
|---|---|---|---|
| ROW_ID | |||
| 21 | 2196-04-09 12:26:00 | 2196-04-10 15:54:00 | 1.144444 |
| 22 | 2153-09-03 07:15:00 | 2153-09-08 19:10:00 | 5.496528 |
| 23 | 2157-10-18 19:34:00 | 2157-10-25 14:00:00 | 6.768056 |
| 24 | 2139-06-06 16:14:00 | 2139-06-09 12:48:00 | 2.856944 |
| 25 | 2160-11-02 02:06:00 | 2160-11-05 14:55:00 | 3.534028 |
# We could already have a quick insight on how LOS values are distributed
admits['LOS'].describe()
count 58976.000000 mean 10.133916 std 12.456682 min -0.945139 25% 3.743750 50% 6.467014 75% 11.795139 max 294.660417 Name: LOS, dtype: float64
We noticed that the mean LOS is 10 days, nut we noticed also that the min LOS calculated is a negative value, how is it possible that a LOS is negative? Let's see records associated with negative values of LOS:
admits[admits['LOS'] < 0]
| SUBJECT_ID | HADM_ID | ADMITTIME | DISCHTIME | DEATHTIME | ADMISSION_TYPE | INSURANCE | ETHNICITY | DIAGNOSIS | LOS | |
|---|---|---|---|---|---|---|---|---|---|---|
| ROW_ID | ||||||||||
| 534 | 417 | 102633 | 2177-03-23 16:17:00 | 2177-03-23 07:20:00 | 2177-03-23 07:20:00 | URGENT | Private | WHITE | ORGAN DONOR ACCOUNT | -0.372917 |
| 237 | 181 | 102631 | 2153-10-12 09:49:00 | 2153-10-12 06:29:00 | 2153-10-12 06:29:00 | EMERGENCY | Private | WHITE | DISSECTING ANEURYSIM | -0.138889 |
| 644 | 516 | 187482 | 2197-07-31 20:18:00 | 2197-07-31 01:10:00 | 2197-07-31 01:10:00 | EMERGENCY | Medicare | UNKNOWN/NOT SPECIFIED | RESPIRATORY DISTRESS | -0.797222 |
| 1640 | 1334 | 138015 | 2137-09-02 14:43:00 | 2137-09-02 12:00:00 | 2137-09-02 12:00:00 | NEWBORN | Private | WHITE | NEWBORN | -0.113194 |
| 1699 | 1381 | 181430 | 2189-01-02 14:25:00 | 2189-01-02 12:00:00 | 2189-01-02 12:00:00 | EMERGENCY | Medicare | WHITE | STROKE;TELEMETRY | -0.100694 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 57980 | 96810 | 165589 | 2103-09-25 20:38:00 | 2103-09-25 12:00:00 | 2103-09-25 12:00:00 | EMERGENCY | Medicaid | WHITE | CARDIAC ARREST | -0.359722 |
| 58635 | 98943 | 193747 | 2164-11-14 20:03:00 | 2164-11-14 12:00:00 | 2164-11-14 12:00:00 | EMERGENCY | Medicare | WHITE | INTRACRANIAL HEMORRHAGE | -0.335417 |
| 58720 | 99207 | 191004 | 2143-07-06 19:59:00 | 2143-07-06 12:00:00 | 2143-07-06 12:00:00 | EMERGENCY | Medicaid | WHITE | GSW L. CHEST WALL & ABDOMEN | -0.332639 |
| 55950 | 90642 | 101946 | 2122-04-24 14:36:00 | 2122-04-24 12:00:00 | 2122-04-24 12:00:00 | EMERGENCY | Private | UNABLE TO OBTAIN | ST-SEGMENT ELEVATION MYOCARDIAL INFARCTION\CATH | -0.108333 |
| 57477 | 95367 | 139266 | 2135-04-03 14:16:00 | 2135-04-03 12:00:00 | 2135-04-03 12:00:00 | EMERGENCY | Medicare | WHITE | CHEST PAIN | -0.094444 |
98 rows × 10 columns
# We noticedd that rows with negative LOS, usually are related to a time of death before admission
# so in this case there is no use to predict LOS, so we drop these rows
admits = admits[admits['LOS'] > 0]
admits.describe()
| SUBJECT_ID | HADM_ID | LOS | |
|---|---|---|---|
| count | 58878.000000 | 58878.000000 | 58878.000000 |
| mean | 33761.791382 | 149966.149886 | 10.151266 |
| std | 28092.613275 | 28882.995648 | 12.459774 |
| min | 2.000000 | 100001.000000 | 0.001389 |
| 25% | 11999.250000 | 124942.750000 | 3.755556 |
| 50% | 24141.000000 | 149987.000000 | 6.489583 |
| 75% | 53862.750000 | 174958.000000 | 11.805556 |
| max | 99999.000000 | 199999.000000 | 294.660417 |
Now we see how the min value for LOS is not negative anymore. To have a more informative view on the distribution of LOS values we plot those values:
# Plot LOS Distribution
plt.hist(admits['LOS'], bins=200, color = '#11a612')
plt.xlim(0, 50)
plt.title('Distribution of LOS for all hospital admissions \n incl. deceased')
plt.ylabel('Count')
plt.xlabel('Length-of-Stay (days)')
plt.tick_params(top=False, right=False)
plt.show();
# We also drop some other attributes from ADMISSIONS table that are not useful anymore for our prediction
admits.drop(columns='DISCHTIME', inplace=True)
admits.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 58878 entries, 21 to 58598 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SUBJECT_ID 58878 non-null int64 1 HADM_ID 58878 non-null int64 2 ADMITTIME 58878 non-null datetime64[ns] 3 DEATHTIME 5774 non-null datetime64[ns] 4 ADMISSION_TYPE 58878 non-null object 5 INSURANCE 58878 non-null object 6 ETHNICITY 58878 non-null object 7 DIAGNOSIS 58853 non-null object 8 LOS 58878 non-null float64 dtypes: datetime64[ns](2), float64(1), int64(2), object(4) memory usage: 4.5+ MB
Another thing to consider is admissions of patients who died at the hospital. This kind of admissions resulting in death will be excluded as they would bias the LOS since LOS would be shorter for this group.
# When DEATHTIME in ADMISSIONS is not null then the patient associated died at the hospital, so we mark these this distinction with a boolean variable
admits['DIED_AT_HOSPITAL'] = admits['DEATHTIME'].notnull().map({True:1, False:0})
print("{} of {} patients died at the hospital".format(admits['DIED_AT_HOSPITAL'].sum(), admits['SUBJECT_ID'].nunique()))
5774 of 46445 patients died at the hospital
We also said that we'll use the LOS mean and median for comparison and for understand the accuracy of our model. So let's compute these LOS metrics that we'll use later for model evalutaion.
# Hospital LOS metrics for later comparison
actual_mean_los = admits['LOS'].loc[admits['DIED_AT_HOSPITAL'] == 0].mean()
actual_median_los = admits['LOS'].loc[admits['DIED_AT_HOSPITAL'] == 0].median()
print(actual_mean_los)
print(actual_median_los)
10.138173704219758 6.565972222222222
# ETHNICITY
admits['ETHNICITY'].value_counts()
WHITE 40939 BLACK/AFRICAN AMERICAN 5434 UNKNOWN/NOT SPECIFIED 4502 HISPANIC OR LATINO 1693 ASIAN 1508 OTHER 1507 UNABLE TO OBTAIN 809 PATIENT DECLINED TO ANSWER 559 ASIAN - CHINESE 277 HISPANIC/LATINO - PUERTO RICAN 232 BLACK/CAPE VERDEAN 200 WHITE - RUSSIAN 164 MULTI RACE ETHNICITY 130 BLACK/HAITIAN 101 ASIAN - ASIAN INDIAN 85 WHITE - OTHER EUROPEAN 81 HISPANIC/LATINO - DOMINICAN 78 PORTUGUESE 61 WHITE - BRAZILIAN 59 ASIAN - VIETNAMESE 53 AMERICAN INDIAN/ALASKA NATIVE 51 BLACK/AFRICAN 44 MIDDLE EASTERN 43 HISPANIC/LATINO - GUATEMALAN 40 WHITE - EASTERN EUROPEAN 25 ASIAN - FILIPINO 25 HISPANIC/LATINO - CUBAN 24 HISPANIC/LATINO - SALVADORAN 19 NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER 18 ASIAN - OTHER 17 ASIAN - CAMBODIAN 17 HISPANIC/LATINO - MEXICAN 13 HISPANIC/LATINO - CENTRAL AMERICAN (OTHER) 13 ASIAN - KOREAN 13 HISPANIC/LATINO - COLOMBIAN 9 CARIBBEAN ISLAND 9 SOUTH AMERICAN 8 ASIAN - JAPANESE 7 ASIAN - THAI 4 HISPANIC/LATINO - HONDURAN 4 AMERICAN INDIAN/ALASKA NATIVE FEDERALLY RECOGNIZED TRIBE 3 Name: ETHNICITY, dtype: int64
We notice that there are a lot of ETHNICITY categories, but most of them are subcategories so we could reduce this number just by considering the main category.
# Compress the number of ethnicity categories
admits['ETHNICITY'].replace(regex=r'^ASIAN\D*', value='ASIAN', inplace=True)
admits['ETHNICITY'].replace(regex=r'^WHITE\D*', value='WHITE', inplace=True)
admits['ETHNICITY'].replace(regex=r'^HISPANIC\D*', value='HISPANIC/LATINO', inplace=True)
admits['ETHNICITY'].replace(regex=r'^BLACK\D*', value='BLACK/AFRICAN AMERICAN', inplace=True)
admits['ETHNICITY'].replace(['UNABLE TO OBTAIN', 'OTHER', 'PATIENT DECLINED TO ANSWER',
'UNKNOWN/NOT SPECIFIED'], value='OTHER/UNKNOWN', inplace=True)
#take into consideration just the top-5 categories with biggest value_count, the others will fall into OTHER category
admits['ETHNICITY'].loc[~admits['ETHNICITY'].isin(admits['ETHNICITY'].value_counts().nlargest(5).index.tolist())] = 'OTHER/UNKNOWN'
admits['ETHNICITY'].value_counts()
C:\Users\nicod\anaconda3\lib\site-packages\pandas\core\indexing.py:1637: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)
WHITE 41268 OTHER/UNKNOWN 7700 BLACK/AFRICAN AMERICAN 5779 HISPANIC/LATINO 2125 ASIAN 2006 Name: ETHNICITY, dtype: int64
# Re-usable plotting function
def plot_los_groupby(variable, size=(7,4)):
'''
Plot Median LOS by dataframe categorical series name
'''
results = admits[[variable, 'LOS']].groupby(variable).median().reset_index()
values = list(results['LOS'].values)
labels = list(results[variable].values)
fig, ax = plt.subplots(figsize=size)
ind = range(len(results))
ax.barh(ind, values, align='center', height=0.6, color = '#55a868', alpha=0.8)
ax.set_yticks(ind)
ax.set_yticklabels(labels)
ax.set_xlabel('Median Length of Stay (days)')
ax.tick_params(left=False, top=False, right=False)
ax.set_title('Comparison of {} labels'.format(variable))
plt.tight_layout()
plt.show();
# Look at median LOS for groups ETHNICITY
plot_los_groupby('ETHNICITY', size=(10,4))
It's interesting to notice that ASIAN patients have the lowest median LOS, even if they are smaller in number in comparison to other ETHNICITY categories.
Now let's do the same analysis done for ETHNICITY also for ADMISSION_TYPE and INSURANCE, if necessary, to reduce the number of possible categories.
# ADMISSION_TYPE
admits['ADMISSION_TYPE'].value_counts()
EMERGENCY 41989 NEWBORN 7854 ELECTIVE 7702 URGENT 1333 Name: ADMISSION_TYPE, dtype: int64
The number og categories in ADMISSION_TYPE is not that high, but the category URGENT contains a smaller subset of entries in comparison to others. In addition it is a lot similar semantically to EMERGENCY, so could combine these two categories and consider URGENT as EMERGENCY.
admits['ADMISSION_TYPE'].replace(to_replace='URGENT', value='EMERGENCY', inplace=True)
admits['ADMISSION_TYPE'].value_counts()
EMERGENCY 43322 NEWBORN 7854 ELECTIVE 7702 Name: ADMISSION_TYPE, dtype: int64
# Look at median LOS for groups ADMISSION_TYPE
plot_los_groupby('ADMISSION_TYPE', size=(10,4))
As we could expected newborns have the lowest median LOS followed by elective admissions. This is expected since these are often somewhat planned for and with the risks being understood in comparison to EMERGENCY ADMISSION_TYPE.
Finally, let's do the same fot INSURANCE feature.
admits['INSURANCE'].value_counts()
Medicare 28174 Private 22542 Medicaid 5778 Government 1781 Self Pay 603 Name: INSURANCE, dtype: int64
Analyzing this attribute we see that the categories are pretty distinct, but we notice that if a patient is 'Self-Pay', typically it could mean that they can't or didn't pay, as matter of fact, Self-pay patients have the lowest LOS because they can't pay to stay more at the hospital.
# read patients table
def read_patients_table(mimic3_path):
pats = dataframe_from_csv(mimic3_path + 'PATIENTS.csv')
# Pre-emptively don't include some columns that I don't need
pats = pats[['SUBJECT_ID', 'GENDER', 'DOB', 'DOD']]
pats.DOB = pd.to_datetime(pats.DOB)
pats.DOD = pd.to_datetime(pats.DOD)
return pats
patients = read_patients_table(mimic3_path)
patients.head()
| SUBJECT_ID | GENDER | DOB | DOD | |
|---|---|---|---|---|
| ROW_ID | ||||
| 234 | 249 | F | 2075-03-13 | NaT |
| 235 | 250 | F | 2164-12-27 | 2188-11-22 |
| 236 | 251 | M | 2090-03-15 | NaT |
| 237 | 252 | M | 2078-03-06 | NaT |
| 238 | 253 | F | 2089-11-26 | NaT |
In PATIENTS table we have DOB (Date of Birth) but not the age of the patient, so to have a look on the age distribution between patients we have to compute this AGE attribute. We can compute it with the difference between a patient date of birth (DOB) and th date of their first admission, so we ignore subsequent admissions in this calculation.
To do this, let's first merge the PATIENTS table with ADMISSIONS table explored previosly.
# drop DOD (Date of Death because it is already present in ADMISSIONS table)
patients = patients[['SUBJECT_ID', 'GENDER', 'DOB']]
# merge the PATIENTS table with ADMISSIONS table
admits_patients = pd.merge(admits, patients, how='inner', on='SUBJECT_ID')
admits_patients.head()
| SUBJECT_ID | HADM_ID | ADMITTIME | DEATHTIME | ADMISSION_TYPE | INSURANCE | ETHNICITY | DIAGNOSIS | LOS | DIED_AT_HOSPITAL | GENDER | DOB | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22 | 165315 | 2196-04-09 12:26:00 | NaT | EMERGENCY | Private | WHITE | BENZODIAZEPINE OVERDOSE | 1.144444 | 0 | F | 2131-05-07 |
| 1 | 23 | 152223 | 2153-09-03 07:15:00 | NaT | ELECTIVE | Medicare | WHITE | CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS... | 5.496528 | 0 | M | 2082-07-17 |
| 2 | 23 | 124321 | 2157-10-18 19:34:00 | NaT | EMERGENCY | Medicare | WHITE | BRAIN MASS | 6.768056 | 0 | M | 2082-07-17 |
| 3 | 24 | 161859 | 2139-06-06 16:14:00 | NaT | EMERGENCY | Private | WHITE | INTERIOR MYOCARDIAL INFARCTION | 2.856944 | 0 | M | 2100-05-31 |
| 4 | 25 | 129635 | 2160-11-02 02:06:00 | NaT | EMERGENCY | Private | WHITE | ACUTE CORONARY SYNDROME | 3.534028 | 0 | M | 2101-11-21 |
# Find the first admission date for each patient
admits_patients_age_min = admits_patients[['SUBJECT_ID', 'ADMITTIME']].groupby('SUBJECT_ID').min().reset_index()
admits_patients_age_min.columns = ['SUBJECT_ID', 'ADMIT_MIN']
admits_patients_age_min.head()
| SUBJECT_ID | ADMIT_MIN | |
|---|---|---|
| 0 | 2 | 2138-07-17 19:04:00 |
| 1 | 3 | 2101-10-20 19:08:00 |
| 2 | 4 | 2191-03-16 00:28:00 |
| 3 | 5 | 2103-02-02 04:31:00 |
| 4 | 6 | 2175-05-30 07:15:00 |
# merge with the created table containing the first admission for each patient
admits_patients = admits_patients.merge(admits_patients_age_min, how='outer', on='SUBJECT_ID')
admits_patients.head()
| SUBJECT_ID | HADM_ID | ADMITTIME | DEATHTIME | ADMISSION_TYPE | INSURANCE | ETHNICITY | DIAGNOSIS | LOS | DIED_AT_HOSPITAL | GENDER | DOB | ADMIT_MIN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22 | 165315 | 2196-04-09 12:26:00 | NaT | EMERGENCY | Private | WHITE | BENZODIAZEPINE OVERDOSE | 1.144444 | 0 | F | 2131-05-07 | 2196-04-09 12:26:00 |
| 1 | 23 | 152223 | 2153-09-03 07:15:00 | NaT | ELECTIVE | Medicare | WHITE | CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS... | 5.496528 | 0 | M | 2082-07-17 | 2153-09-03 07:15:00 |
| 2 | 23 | 124321 | 2157-10-18 19:34:00 | NaT | EMERGENCY | Medicare | WHITE | BRAIN MASS | 6.768056 | 0 | M | 2082-07-17 | 2153-09-03 07:15:00 |
| 3 | 24 | 161859 | 2139-06-06 16:14:00 | NaT | EMERGENCY | Private | WHITE | INTERIOR MYOCARDIAL INFARCTION | 2.856944 | 0 | M | 2100-05-31 | 2139-06-06 16:14:00 |
| 4 | 25 | 129635 | 2160-11-02 02:06:00 | NaT | EMERGENCY | Private | WHITE | ACUTE CORONARY SYNDROME | 3.534028 | 0 | M | 2101-11-21 | 2160-11-02 02:06:00 |
# Now we can compute the age by finding the difference between first admission date and date of birth
admits_patients['AGE'] = (admits_patients['ADMIT_MIN'].subtract(admits_patients['DOB'])).dt.days // 365
# DOB has been shifted for patients older than 89 to obscure their age and comply with HIPAA, the median age for the patients whose date of birth was shifted is 91.4
admits_patients['AGE'] = np.where(admits_patients['AGE'] < 0, 91.4, admits_patients['AGE'])
admits_patients['AGE'].isnull().sum()
0
#let's see the distribution of age
plt.hist(admits_patients['AGE'], bins=20, color='#a12a52')
plt.ylabel('Count')
plt.xlabel('Age (years)')
plt.title('Distribution of Age in MIMIC-III')
plt.tick_params(left=False, bottom=False, top=False, right=False)
plt.show()
As we can see from age distribution, patients in their childhood are not present, this reflects the fact that MIMIC-III does not contain data from pediatric patients.
Now let's see how the LOS, our current goal, is correlated to ther age of the patients.
plt.scatter(admits_patients['AGE'], admits_patients['LOS'], alpha=0.005)
plt.ylabel('LOS (days)')
plt.xlabel('Age (years)')
plt.title('Age versus Length-of-stay')
plt.ylim(1, 50)
(1.0, 50.0)
The plot highlights the MIMIC groups of newborns and >89 year olds, and there is an increasing amount of admissions going from 20 toward 80 years old. Because of the discrete-like distribution of data on the extremes of age, it could be useful to convert all ages into the categories of newborn, young adult, middle adult, and senior for use in the prediction model.
age_ranges = [(0, 13), (13, 36), (36, 56), (56, 100)]
for num, cat_range in enumerate(age_ranges):
admits_patients['AGE'] = np.where(admits_patients['AGE'].between(cat_range[0],cat_range[1]), num, admits_patients['AGE'])
age_dict = {0: 'NEWBORN', 1: 'YOUNG_ADULT', 2: 'MIDDLE_ADULT', 3: 'SENIOR'}
admits_patients['AGE'] = admits_patients['AGE'].replace(age_dict)
admits_patients.AGE.value_counts()
SENIOR 33785 MIDDLE_ADULT 12732 NEWBORN 8101 YOUNG_ADULT 4260 Name: AGE, dtype: int64
Finally, let's see the distribution of gender in patients in correlation to LOS, before to explore the next table.
# Re-usable boxplot function
def boxplot_los_groupby(variable, los_range=(-1, 30), size=(8,4)):
'''
Boxplot of LOS by df categorical series name
'''
results = admits_patients[[variable, 'LOS']].groupby(variable).median().reset_index()
categories = results[variable].values.tolist()
hist_data = []
for cat in categories:
hist_data.append(admits_patients['LOS'].loc[admits_patients[variable]==cat].values)
fig, ax = plt.subplots(figsize=size)
ax.boxplot(hist_data, 0, '', vert=False)
ax.set_xlim(los_range)
ax.set_yticklabels(categories)
ax.set_xlabel('Length of Stay (days)')
ax.tick_params(left=False, right=False)
ax.set_title('Comparison of {} categories'.format(variable))
plt.tight_layout()
plt.show();
boxplot_los_groupby('GENDER', los_range=(0, 30))
admits_patients['GENDER'].replace({'M': 0, 'F':1}, inplace=True)
# read diagnoses_icd table
def read_diagnoses_icd_table(mimic3_path):
diag_icds = dataframe_from_csv(mimic3_path + 'DIAGNOSES_ICD.csv')
return diag_icds
diag_icds = read_diagnoses_icd_table(mimic3_path)
diag_icds.head()
| SUBJECT_ID | HADM_ID | SEQ_NUM | ICD9_CODE | |
|---|---|---|---|---|
| ROW_ID | ||||
| 1297 | 109 | 172335 | 1.0 | 40301 |
| 1298 | 109 | 172335 | 2.0 | 486 |
| 1299 | 109 | 172335 | 3.0 | 58281 |
| 1300 | 109 | 172335 | 4.0 | 5855 |
| 1301 | 109 | 172335 | 5.0 | 4254 |
International Classification of Diseases, Clinical Modification (ICD-9-CM) is an adaption created by the U.S. National Center for Health Statistics (NCHS) and used in assigning diagnostic and procedure codes associated with inpatient, outpatient, and physician office utilization in the United States (from WIKIPEDIA https://en.wikipedia.org/wiki/International_Classification_of_Diseases#ICD-9 ).
print('There are {} unique ICD9 codes in this dataset.'.format(diag_icds['ICD9_CODE'].value_counts().count()))
There are 6984 unique ICD9 codes in this dataset.
Because it's not feasible to have 6984 unique values to use as features for predicting LOS, it is necessary to reduce the diagnosis into more general categories. After researching the ICD9 approach, (from WKIPEDIA source https://en.wikipedia.org/wiki/List_of_ICD-9_codes) it's been noticed that they are arranged into super categories as the following:

As we see our attention could be just on the numeric first 3 values. So our task now is to recode each ICD-9 code to its supercategory.
# Filter out E and V codes since processing will be done on the numeric first 3 values
diag_icds['RECODE'] = diag_icds['ICD9_CODE']
diag_icds['RECODE'] = diag_icds['RECODE'][~diag_icds['RECODE'].str.contains("[a-zA-Z]").fillna(False)]
diag_icds['RECODE'].fillna(value='999', inplace=True)
# Take in consideration just the first 3 integers of the ICD9 code
diag_icds['RECODE'] = diag_icds['RECODE'].str.slice(start=0, stop=3, step=1)
diag_icds['RECODE'] = diag_icds['RECODE'].astype(int)
diag_icds.head()
| SUBJECT_ID | HADM_ID | SEQ_NUM | ICD9_CODE | RECODE | |
|---|---|---|---|---|---|
| ROW_ID | |||||
| 1297 | 109 | 172335 | 1.0 | 40301 | 403 |
| 1298 | 109 | 172335 | 2.0 | 486 | 486 |
| 1299 | 109 | 172335 | 3.0 | 58281 | 582 |
| 1300 | 109 | 172335 | 4.0 | 5855 | 585 |
| 1301 | 109 | 172335 | 5.0 | 4254 | 425 |
# ICD-9 Main Category ranges
icd9_ranges = [(1, 140), (140, 240), (240, 280), (280, 290), (290, 320), (320, 390),
(390, 460), (460, 520), (520, 580), (580, 630), (630, 680), (680, 710),
(710, 740), (740, 760), (760, 780), (780, 800), (800, 1000), (1000, 2000)]
# Associated category names
diag_dict = {0: 'infectious', 1: 'neoplasms', 2: 'endocrine', 3: 'blood',
4: 'mental', 5: 'nervous', 6: 'circulatory', 7: 'respiratory',
8: 'digestive', 9: 'genitourinary', 10: 'pregnancy', 11: 'skin',
12: 'muscular', 13: 'congenital', 14: 'prenatal', 15: 'misc',
16: 'injury', 17: 'misc'}
# Re-code in terms of integer
for num, cat_range in enumerate(icd9_ranges):
diag_icds['RECODE'] = np.where(diag_icds['RECODE'].between(cat_range[0],cat_range[1]), num, diag_icds['RECODE'])
# Convert integer to category name using diag_dict
diag_icds['SUPER_CATEGORY'] = diag_icds['RECODE'].replace(diag_dict)
diag_icds.head()
| SUBJECT_ID | HADM_ID | SEQ_NUM | ICD9_CODE | RECODE | SUPER_CATEGORY | |
|---|---|---|---|---|---|---|
| ROW_ID | ||||||
| 1297 | 109 | 172335 | 1.0 | 40301 | 6 | circulatory |
| 1298 | 109 | 172335 | 2.0 | 486 | 7 | respiratory |
| 1299 | 109 | 172335 | 3.0 | 58281 | 9 | genitourinary |
| 1300 | 109 | 172335 | 4.0 | 5855 | 9 | genitourinary |
| 1301 | 109 | 172335 | 5.0 | 4254 | 6 | circulatory |
For each admission, usually there is more than one diagnosis. Often, there are more than 1 diagnoses for 1 category.
We could create a matrix that highlights all the diagnoses for each admission. This should not be done on the SUBJECT_ID since each patient could have different diagnoses for each admission.
# Create list of diagnoses for each admission
hadm_list = diag_icds.groupby('HADM_ID')['SUPER_CATEGORY'].apply(list).reset_index()
hadm_list.head()
| HADM_ID | SUPER_CATEGORY | |
|---|---|---|
| 0 | 100001 | [endocrine, nervous, genitourinary, digestive,... |
| 1 | 100003 | [digestive, blood, infectious, digestive, circ... |
| 2 | 100006 | [respiratory, respiratory, respiratory, neopla... |
| 3 | 100007 | [digestive, digestive, injury, respiratory, ci... |
| 4 | 100009 | [circulatory, injury, circulatory, endocrine, ... |
# Convert diagnoses list into hospital admission-item matrix
hadm_item = pd.get_dummies(hadm_list['SUPER_CATEGORY'].apply(pd.Series).stack()).sum(level=0)
hadm_item.head()
| blood | circulatory | congenital | digestive | endocrine | genitourinary | infectious | injury | mental | misc | muscular | neoplasms | nervous | pregnancy | prenatal | respiratory | skin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2 | 0 | 2 | 5 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 1 |
| 1 | 1 | 2 | 0 | 4 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 3 | 0 |
| 3 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 1 | 7 | 0 | 0 | 3 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# Join back with HADM_ID
hadm_item = hadm_item.join(hadm_list['HADM_ID'], how="outer")
# Merge with main admissions dataframe
admits_patients_diag = pd.merge(admits_patients, hadm_item, how='inner', on='HADM_ID')
admits_patients_diag.head()
| SUBJECT_ID | HADM_ID | ADMITTIME | DEATHTIME | ADMISSION_TYPE | INSURANCE | ETHNICITY | DIAGNOSIS | LOS | DIED_AT_HOSPITAL | ... | injury | mental | misc | muscular | neoplasms | nervous | pregnancy | prenatal | respiratory | skin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22 | 165315 | 2196-04-09 12:26:00 | NaT | EMERGENCY | Private | WHITE | BENZODIAZEPINE OVERDOSE | 1.144444 | 0 | ... | 4 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 23 | 152223 | 2153-09-03 07:15:00 | NaT | ELECTIVE | Medicare | WHITE | CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS... | 5.496528 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 23 | 124321 | 2157-10-18 19:34:00 | NaT | EMERGENCY | Medicare | WHITE | BRAIN MASS | 6.768056 | 0 | ... | 3 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
| 3 | 24 | 161859 | 2139-06-06 16:14:00 | NaT | EMERGENCY | Private | WHITE | INTERIOR MYOCARDIAL INFARCTION | 2.856944 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 25 | 129635 | 2160-11-02 02:06:00 | NaT | EMERGENCY | Private | WHITE | ACUTE CORONARY SYNDROME | 3.534028 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 31 columns
# Look at the median LOS by diagnosis category
diag_cat_list = ['skin', 'infectious', 'misc', 'genitourinary', 'neoplasms', 'blood', 'respiratory',
'congenital','nervous', 'muscular', 'digestive', 'mental', 'endocrine', 'injury',
'circulatory', 'prenatal', 'pregnancy']
results = []
for variable in diag_cat_list:
results.append(admits_patients_diag[[variable, 'LOS']].groupby(variable).median().reset_index().values[1][1])
sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize=(7,5))
ind = range(len(results))
ax.barh(ind, results, align='edge', color = '#55a868', alpha=0.8)
ax.set_yticks(ind)
ax.set_yticklabels(diag_cat_list)
ax.set_xlabel('Median Length of Stay (days)')
ax.tick_params(left=False, right=False, top=False)
ax.set_title('Comparison of Diagnoses'.format(variable))
plt.show();
Looking at the median LOS for each ICD-9 supercategory shows an impressive spread between pregnancy and skin diagnosis code groups.
# read icustays table
def read_icustays_table(mimic3_path):
icustays = dataframe_from_csv(mimic3_path + 'ICUSTAYS.csv')
return icustays
icustays = read_icustays_table(mimic3_path)
icustays.head()
| SUBJECT_ID | HADM_ID | ICUSTAY_ID | DBSOURCE | FIRST_CAREUNIT | LAST_CAREUNIT | FIRST_WARDID | LAST_WARDID | INTIME | OUTTIME | LOS | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ROW_ID | |||||||||||
| 365 | 268 | 110404 | 280836 | carevue | MICU | MICU | 52 | 52 | 2198-02-14 23:27:38 | 2198-02-18 05:26:11 | 3.2490 |
| 366 | 269 | 106296 | 206613 | carevue | MICU | MICU | 52 | 52 | 2170-11-05 11:05:29 | 2170-11-08 17:46:57 | 3.2788 |
| 367 | 270 | 188028 | 220345 | carevue | CCU | CCU | 57 | 57 | 2128-06-24 15:05:20 | 2128-06-27 12:32:29 | 2.8939 |
| 368 | 271 | 173727 | 249196 | carevue | MICU | SICU | 52 | 23 | 2120-08-07 23:12:42 | 2120-08-10 00:39:04 | 2.0600 |
| 369 | 272 | 164716 | 210407 | carevue | CCU | CCU | 57 | 57 | 2186-12-25 21:08:04 | 2186-12-27 12:01:13 | 1.6202 |
icustays.groupby('FIRST_CAREUNIT').median()
| SUBJECT_ID | HADM_ID | ICUSTAY_ID | FIRST_WARDID | LAST_WARDID | LOS | |
|---|---|---|---|---|---|---|
| FIRST_CAREUNIT | ||||||
| CCU | 22964.5 | 150074.5 | 249373.5 | 7.0 | 7.0 | 2.19775 |
| CSRU | 24488.0 | 150225.0 | 250492.0 | 14.0 | 14.0 | 2.15290 |
| MICU | 26489.5 | 150368.0 | 250524.0 | 50.0 | 50.0 | 2.09550 |
| NICU | 15456.5 | 149206.5 | 249308.0 | 56.0 | 56.0 | 0.80250 |
| SICU | 30084.0 | 149744.0 | 248649.0 | 33.0 | 33.0 | 2.25220 |
| TSICU | 28716.0 | 148915.0 | 250685.0 | 14.0 | 14.0 | 2.11150 |
From this statistic we can see how, as far as LOS is concerned, a substantial difference in the median is found only between NICU and the other categories. The other categories have a very similar median. We can therefore think of simply reducing on two groups: NICU and ICU (which includes all the others).
icustays['FIRST_CAREUNIT'].replace({'CCU': 'ICU', 'CSRU': 'ICU', 'MICU': 'ICU','SICU': 'ICU', 'TSICU': 'ICU'}, inplace=True)
icustays['CATEGORY'] = icustays['FIRST_CAREUNIT']
icu_list = icustays.groupby('HADM_ID')['CATEGORY'].apply(list).reset_index()
icu_list.head()
| HADM_ID | CATEGORY | |
|---|---|---|
| 0 | 100001 | [ICU] |
| 1 | 100003 | [ICU] |
| 2 | 100006 | [ICU] |
| 3 | 100007 | [ICU] |
| 4 | 100009 | [ICU] |
icustays['FIRST_CAREUNIT'].value_counts()
ICU 53432 NICU 8100 Name: FIRST_CAREUNIT, dtype: int64
# Create admission-ICU matrix
icu_item = pd.get_dummies(icu_list['CATEGORY'].apply(pd.Series).stack()).sum(level=0)
icu_item[icu_item >= 1] = 1
icu_item = icu_item.join(icu_list['HADM_ID'], how="outer")
icu_item.head()
| ICU | NICU | HADM_ID | |
|---|---|---|---|
| 0 | 1 | 0 | 100001 |
| 1 | 1 | 0 | 100003 |
| 2 | 1 | 0 | 100006 |
| 3 | 1 | 0 | 100007 |
| 4 | 1 | 0 | 100009 |
# Merge ICU data with main dataFrame
final_df = admits_patients_diag.merge(icu_item, how='outer', on='HADM_ID')
final_df.head()
| SUBJECT_ID | HADM_ID | ADMITTIME | DEATHTIME | ADMISSION_TYPE | INSURANCE | ETHNICITY | DIAGNOSIS | LOS | DIED_AT_HOSPITAL | ... | misc | muscular | neoplasms | nervous | pregnancy | prenatal | respiratory | skin | ICU | NICU | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22.0 | 165315 | 2196-04-09 12:26:00 | NaT | EMERGENCY | Private | WHITE | BENZODIAZEPINE OVERDOSE | 1.144444 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 23.0 | 152223 | 2153-09-03 07:15:00 | NaT | ELECTIVE | Medicare | WHITE | CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS... | 5.496528 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 23.0 | 124321 | 2157-10-18 19:34:00 | NaT | EMERGENCY | Medicare | WHITE | BRAIN MASS | 6.768056 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 24.0 | 161859 | 2139-06-06 16:14:00 | NaT | EMERGENCY | Private | WHITE | INTERIOR MYOCARDIAL INFARCTION | 2.856944 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | 25.0 | 129635 | 2160-11-02 02:06:00 | NaT | EMERGENCY | Private | WHITE | ACUTE CORONARY SYNDROME | 3.534028 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows × 33 columns
final_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 58962 entries, 0 to 58961 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SUBJECT_ID 58878 non-null float64 1 HADM_ID 58962 non-null int64 2 ADMITTIME 58878 non-null datetime64[ns] 3 DEATHTIME 5774 non-null datetime64[ns] 4 ADMISSION_TYPE 58878 non-null object 5 INSURANCE 58878 non-null object 6 ETHNICITY 58878 non-null object 7 DIAGNOSIS 58853 non-null object 8 LOS 58878 non-null float64 9 DIED_AT_HOSPITAL 58878 non-null float64 10 GENDER 58878 non-null float64 11 DOB 58878 non-null datetime64[ns] 12 ADMIT_MIN 58878 non-null datetime64[ns] 13 AGE 58878 non-null object 14 blood 58878 non-null float64 15 circulatory 58878 non-null float64 16 congenital 58878 non-null float64 17 digestive 58878 non-null float64 18 endocrine 58878 non-null float64 19 genitourinary 58878 non-null float64 20 infectious 58878 non-null float64 21 injury 58878 non-null float64 22 mental 58878 non-null float64 23 misc 58878 non-null float64 24 muscular 58878 non-null float64 25 neoplasms 58878 non-null float64 26 nervous 58878 non-null float64 27 pregnancy 58878 non-null float64 28 prenatal 58878 non-null float64 29 respiratory 58878 non-null float64 30 skin 58878 non-null float64 31 ICU 57786 non-null float64 32 NICU 57786 non-null float64 dtypes: datetime64[ns](4), float64(23), int64(1), object(5) memory usage: 15.3+ MB
# Replace NaNs with 0
final_df['ICU'].fillna(value=0, inplace=True)
final_df['NICU'].fillna(value=0, inplace=True)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-42-7de054178792> in <module> 1 # Replace NaNs with 0 ----> 2 df['ICU'].fillna(value=0, inplace=True) 3 df['NICU'].fillna(value=0, inplace=True) NameError: name 'df' is not defined
final_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 58962 entries, 0 to 58961 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SUBJECT_ID 58878 non-null float64 1 HADM_ID 58962 non-null int64 2 ADMITTIME 58878 non-null datetime64[ns] 3 DEATHTIME 5774 non-null datetime64[ns] 4 ADMISSION_TYPE 58878 non-null object 5 INSURANCE 58878 non-null object 6 ETHNICITY 58878 non-null object 7 DIAGNOSIS 58853 non-null object 8 LOS 58878 non-null float64 9 DIED_AT_HOSPITAL 58878 non-null float64 10 GENDER 58878 non-null float64 11 DOB 58878 non-null datetime64[ns] 12 ADMIT_MIN 58878 non-null datetime64[ns] 13 AGE 58878 non-null object 14 blood 58878 non-null float64 15 circulatory 58878 non-null float64 16 congenital 58878 non-null float64 17 digestive 58878 non-null float64 18 endocrine 58878 non-null float64 19 genitourinary 58878 non-null float64 20 infectious 58878 non-null float64 21 injury 58878 non-null float64 22 mental 58878 non-null float64 23 misc 58878 non-null float64 24 muscular 58878 non-null float64 25 neoplasms 58878 non-null float64 26 nervous 58878 non-null float64 27 pregnancy 58878 non-null float64 28 prenatal 58878 non-null float64 29 respiratory 58878 non-null float64 30 skin 58878 non-null float64 31 ICU 58962 non-null float64 32 NICU 58962 non-null float64 dtypes: datetime64[ns](4), float64(23), int64(1), object(5) memory usage: 15.3+ MB
# Remove deceased persons as they will skew LOS result
final_df = final_df[final_df['DIED_AT_HOSPITAL'] == 0]
final_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 53104 entries, 0 to 58877 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SUBJECT_ID 53104 non-null float64 1 HADM_ID 53104 non-null int64 2 ADMITTIME 53104 non-null datetime64[ns] 3 DEATHTIME 0 non-null datetime64[ns] 4 ADMISSION_TYPE 53104 non-null object 5 INSURANCE 53104 non-null object 6 ETHNICITY 53104 non-null object 7 DIAGNOSIS 53080 non-null object 8 LOS 53104 non-null float64 9 DIED_AT_HOSPITAL 53104 non-null float64 10 GENDER 53104 non-null float64 11 DOB 53104 non-null datetime64[ns] 12 ADMIT_MIN 53104 non-null datetime64[ns] 13 AGE 53104 non-null object 14 blood 53104 non-null float64 15 circulatory 53104 non-null float64 16 congenital 53104 non-null float64 17 digestive 53104 non-null float64 18 endocrine 53104 non-null float64 19 genitourinary 53104 non-null float64 20 infectious 53104 non-null float64 21 injury 53104 non-null float64 22 mental 53104 non-null float64 23 misc 53104 non-null float64 24 muscular 53104 non-null float64 25 neoplasms 53104 non-null float64 26 nervous 53104 non-null float64 27 pregnancy 53104 non-null float64 28 prenatal 53104 non-null float64 29 respiratory 53104 non-null float64 30 skin 53104 non-null float64 31 ICU 53104 non-null float64 32 NICU 53104 non-null float64 dtypes: datetime64[ns](4), float64(23), int64(1), object(5) memory usage: 13.8+ MB
# Remove LOS with negative number, likely entry form error
final_df = final_df[final_df['LOS'] > 0]
final_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 58878 entries, 0 to 58877 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SUBJECT_ID 58878 non-null float64 1 HADM_ID 58878 non-null int64 2 ADMITTIME 58878 non-null datetime64[ns] 3 DEATHTIME 5774 non-null datetime64[ns] 4 ADMISSION_TYPE 58878 non-null object 5 INSURANCE 58878 non-null object 6 ETHNICITY 58878 non-null object 7 DIAGNOSIS 58853 non-null object 8 LOS 58878 non-null float64 9 DIED_AT_HOSPITAL 58878 non-null float64 10 GENDER 58878 non-null float64 11 DOB 58878 non-null datetime64[ns] 12 ADMIT_MIN 58878 non-null datetime64[ns] 13 AGE 58878 non-null object 14 blood 58878 non-null float64 15 circulatory 58878 non-null float64 16 congenital 58878 non-null float64 17 digestive 58878 non-null float64 18 endocrine 58878 non-null float64 19 genitourinary 58878 non-null float64 20 infectious 58878 non-null float64 21 injury 58878 non-null float64 22 mental 58878 non-null float64 23 misc 58878 non-null float64 24 muscular 58878 non-null float64 25 neoplasms 58878 non-null float64 26 nervous 58878 non-null float64 27 pregnancy 58878 non-null float64 28 prenatal 58878 non-null float64 29 respiratory 58878 non-null float64 30 skin 58878 non-null float64 31 ICU 57702 non-null float64 32 NICU 57702 non-null float64 dtypes: datetime64[ns](4), float64(23), int64(1), object(5) memory usage: 15.3+ MB
# Drop unused or no longer needed columns
final_df.drop(columns=['SUBJECT_ID', 'HADM_ID', 'ADMITTIME', 'ADMIT_MIN', 'DOB',
'DIAGNOSIS', 'DIED_AT_HOSPITAL', 'DEATHTIME'], inplace=True)
final_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 53104 entries, 0 to 58877 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ADMISSION_TYPE 53104 non-null object 1 INSURANCE 53104 non-null object 2 ETHNICITY 53104 non-null object 3 LOS 53104 non-null float64 4 GENDER 53104 non-null float64 5 AGE 53104 non-null object 6 blood 53104 non-null float64 7 circulatory 53104 non-null float64 8 congenital 53104 non-null float64 9 digestive 53104 non-null float64 10 endocrine 53104 non-null float64 11 genitourinary 53104 non-null float64 12 infectious 53104 non-null float64 13 injury 53104 non-null float64 14 mental 53104 non-null float64 15 misc 53104 non-null float64 16 muscular 53104 non-null float64 17 neoplasms 53104 non-null float64 18 nervous 53104 non-null float64 19 pregnancy 53104 non-null float64 20 prenatal 53104 non-null float64 21 respiratory 53104 non-null float64 22 skin 53104 non-null float64 23 ICU 53104 non-null float64 24 NICU 53104 non-null float64 dtypes: float64(21), object(4) memory usage: 10.5+ MB
final_df.head()
| ADMISSION_TYPE | INSURANCE | ETHNICITY | LOS | GENDER | AGE | blood | circulatory | congenital | digestive | ... | misc | muscular | neoplasms | nervous | pregnancy | prenatal | respiratory | skin | ICU | NICU | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | EMERGENCY | Private | WHITE | 1.144444 | 1.0 | SENIOR | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | ELECTIVE | Medicare | WHITE | 5.496528 | 0.0 | SENIOR | 0.0 | 4.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | EMERGENCY | Medicare | WHITE | 6.768056 | 0.0 | SENIOR | 0.0 | 2.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | EMERGENCY | Private | WHITE | 2.856944 | 0.0 | MIDDLE_ADULT | 0.0 | 2.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | EMERGENCY | Private | WHITE | 3.534028 | 0.0 | SENIOR | 0.0 | 3.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows × 25 columns
# Create dummy columns for categorical variables
prefix_cols = ['ADM', 'INS', 'ETH', 'AGE']
dummy_cols = ['ADMISSION_TYPE', 'INSURANCE','ETHNICITY', 'AGE']
final_df = pd.get_dummies(final_df, prefix=prefix_cols, columns=dummy_cols)
final_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 53104 entries, 0 to 58877 Data columns (total 38 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LOS 53104 non-null float64 1 GENDER 53104 non-null float64 2 blood 53104 non-null float64 3 circulatory 53104 non-null float64 4 congenital 53104 non-null float64 5 digestive 53104 non-null float64 6 endocrine 53104 non-null float64 7 genitourinary 53104 non-null float64 8 infectious 53104 non-null float64 9 injury 53104 non-null float64 10 mental 53104 non-null float64 11 misc 53104 non-null float64 12 muscular 53104 non-null float64 13 neoplasms 53104 non-null float64 14 nervous 53104 non-null float64 15 pregnancy 53104 non-null float64 16 prenatal 53104 non-null float64 17 respiratory 53104 non-null float64 18 skin 53104 non-null float64 19 ICU 53104 non-null float64 20 NICU 53104 non-null float64 21 ADM_ELECTIVE 53104 non-null uint8 22 ADM_EMERGENCY 53104 non-null uint8 23 ADM_NEWBORN 53104 non-null uint8 24 INS_Government 53104 non-null uint8 25 INS_Medicaid 53104 non-null uint8 26 INS_Medicare 53104 non-null uint8 27 INS_Private 53104 non-null uint8 28 INS_Self Pay 53104 non-null uint8 29 ETH_ASIAN 53104 non-null uint8 30 ETH_BLACK/AFRICAN AMERICAN 53104 non-null uint8 31 ETH_HISPANIC/LATINO 53104 non-null uint8 32 ETH_OTHER/UNKNOWN 53104 non-null uint8 33 ETH_WHITE 53104 non-null uint8 34 AGE_MIDDLE_ADULT 53104 non-null uint8 35 AGE_NEWBORN 53104 non-null uint8 36 AGE_SENIOR 53104 non-null uint8 37 AGE_YOUNG_ADULT 53104 non-null uint8 dtypes: float64(21), uint8(17) memory usage: 9.8 MB
# Check for any remaining NaNs
final_df.isnull().values.sum()
0
The final DataFrame size resulted in 37 feature columns and 1 target column (LOS) with an entry count of 53,104.
What is Supervised Learning and why we choose it? TO DO
We will implement the supervised learning prediction model using the Scikit-Learn machine learning library.
To implement the prediction model, our dataset is splitted into training and test sets at an 80:20 ratio using the scikit-learn train_test_split function.
why split in training and test set? TO DO
Using the training set, we'll fit five different regression models (from the scikit-learn library) using default settings to see what the R2 score comparison looked like.
# Target Variable (Length-of-Stay-LOS)
LOS = final_df['LOS'].values
# Prediction Features
features = final_df.drop(columns=['LOS'])
# Split into training set 80% and test set 20%
X_train, X_test, y_train, y_test = train_test_split(features,
LOS,
test_size = .20,
random_state = 0)
# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
Training set has 42483 samples. Testing set has 10621 samples.
# Regression models used from scikit-learn for comparison
models = [SGDRegressor(random_state = 0),
GradientBoostingRegressor(random_state = 0),
LinearRegression(),
KNeighborsRegressor(),
RandomForestRegressor(random_state = 0)]
results = {}
for model in models:
# Instantiate and fit Regressor Model
reg_model = model
reg_model.fit(X_train, y_train)
# Make predictions with model
y_test_preds = reg_model.predict(X_test)
# Grab model name and store results associated with model
name = str(model).split("(")[0]
results[name] = r2_score(y_test, y_test_preds)
print('{} done.'.format(name))
SGDRegressor done. GradientBoostingRegressor done. LinearRegression done. KNeighborsRegressor done. RandomForestRegressor done.
# R2 score results
fig, ax = plt.subplots()
ind = range(len(results))
ax.barh(ind, list(results.values()), align='center',
color = '#55a868', alpha=0.8)
ax.set_yticks(ind)
ax.set_yticklabels(results.keys())
ax.set_xlabel('R-squared score')
ax.tick_params(left=False, top=False, right=False)
ax.set_title('Comparison of Regression Models')
fig.savefig('images/compare_models.png', bbox_inches = 'tight')
The GradientBoostingRegressor has the best R2 score of ~37% so we focus on refining this particular model.
# GradientBoostingRegressor will be used as the LOS prediction model
reg_model = GradientBoostingRegressor(random_state=0)
reg_model.fit(X_train, y_train)
y_test_preds = reg_model.predict(X_test)
r2_not_refined = r2_score(y_test, y_test_preds)
print("R2 score is: {:2f}".format(r2_not_refined))
R2 score is: 0.375523
To refine the GradientBoostingRegressor model, GridSearchCV function from scikit-learnis used to test out various permutations of parameters such as n_estimators, max_depth, and loss.
What is GridSearchCV and in general how it works? TO DO
# Split into train 80% and test 20%
X_train, X_test, y_train, y_test = train_test_split(features,
LOS,
test_size = .20,
random_state = 42)
# Set the parameters by cross-validation
#tuned_parameters = [{'n_estimators': [100, 200, 300],
# 'max_depth' : [2, 3, 4],
# 'loss': ['ls', 'lad', 'huber']}]
tuned_parameters = [{'n_estimators': [200, 300],
'max_depth' : [3, 4],
'loss': ['ls', 'lad']}]
# create and fit a ridge regression model, testing each alpha
reg_model = GradientBoostingRegressor()
grid = GridSearchCV(reg_model, tuned_parameters)
grid.fit(X_train, y_train)
reg_model_optimized = grid.best_estimator_
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_)
0.4069102774658885 GradientBoostingRegressor(max_depth=4, n_estimators=200)
The best estimator result from GridSearchCV was n_estimators=200, max_depth=4, and loss=ls.
y_test_preds = reg_model_optimized.predict(X_test)
r2_optimized = r2_score(y_test, y_test_preds)
print("Optimized R2 score is: {:2f}".format(r2_optimized))
Optimized R2 score is: 0.390046
print('Parameter tuning improved R2 score by {:.4f}'.format(r2_optimized-r2_not_refined))
Parameter tuning improved R2 score by 0.0145
## 8. Model evaluation and result Discussion
First of al we could look at what features were most important in predicting hospital length-of-stay when using the gradient boosting regression model.
feature_imp = pd.DataFrame(reg_model_optimized.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
feature_imp.head(20)
| importance | |
|---|---|
| prenatal | 0.419917 |
| respiratory | 0.126279 |
| nervous | 0.092340 |
| infectious | 0.090034 |
| digestive | 0.066756 |
| injury | 0.038093 |
| skin | 0.024466 |
| genitourinary | 0.016657 |
| congenital | 0.016028 |
| endocrine | 0.014203 |
| circulatory | 0.013947 |
| misc | 0.009343 |
| ICU | 0.007887 |
| mental | 0.007178 |
| blood | 0.005835 |
| neoplasms | 0.004422 |
| INS_Medicare | 0.004193 |
| AGE_SENIOR | 0.004126 |
| muscular | 0.004109 |
| ADM_EMERGENCY | 0.003803 |
#Let's plot the top-10 feature importance
feature_imp.index[0:10].tolist()
# Plot feature importance
fig, ax = plt.subplots(figsize=(7, 5))
ind = range(0,10)
ax.barh(ind, feature_imp['importance'].values[0:10],
align='center', color='#c44e52', alpha=0.9)
ax.set_yticks(ind)
ax.set_yticklabels(feature_imp.index[0:10].tolist())
ax.tick_params(left=False, top=False, right=False)
ax.set_title("Top 10 features for predicting LOS")
ax.set_xlabel('Feature Importance Coefficient \n(GradientBoostingRegressor)')
plt.gca().invert_yaxis()
fig.savefig('images/feature_importance.png', bbox_inches = 'tight')
Diagnoses related to prenatal issues have the highest feature importance coefficient followed by respiratory and nervous. So we could say that one of the results is that the ICD-9 diagnoses categories are by far the most important features.
In previous metric section, we said that the RMSE would be used to compare the prediction model versus the industry-standard average and median LOS metrics. The gradient boosting model RMSE is better by more than 24% (percent difference) versus the constant average or median models (as we can see from the graphic below).
#y_test_preds = reg_model.predict(X_test)
ml_count, md_count, avg_count = 0, 0, 0
ml_days, md_days, avg_days = 0, 0, 0
ml_days_rms, md_days_rms, avg_days_rms = 0, 0, 0
for i in range(y_test_preds.shape[0]):
ml_model = abs(y_test_preds[i] - y_test[i])
median_model = abs(actual_median_los - y_test[i])
average_model = abs(actual_mean_los - y_test[i])
ml_days += ml_model
md_days += median_model
avg_days += average_model
ml_model_rms = (y_test_preds[i] - y_test[i])**2
median_model_rms = (actual_median_los - y_test[i])**2
average_model_rms = (actual_mean_los - y_test[i])**2
ml_days_rms += ml_model_rms
md_days_rms += median_model_rms
avg_days_rms += average_model_rms
print("Prediction Model days {}".format(ml_days/y_test_preds.shape[0]))
print("Median Model days {}".format(md_days/y_test_preds.shape[0]))
print("Average Model days {}".format(avg_days/y_test_preds.shape[0]))
print("Prediction Model RMS {}".format((ml_days_rms**0.5)/y_test_preds.shape[0]))
print("Median Model RMS {}".format((md_days_rms**0.5)/y_test_preds.shape[0]))
print("Average Model RMS {}".format((avg_days_rms**0.5)/y_test_preds.shape[0]))
Prediction Model days 5.447081431720642 Median Model days 6.387375443304172 Average Model days 7.223816450884783 Prediction Model RMS 0.09550444684399109 Median Model RMS 0.12697346364877565 Average Model RMS 0.12228645096382446
# RMSE plot
data = pd.DataFrame({'RMSE': [(ml_days_rms**0.5)/y_test_preds.shape[0],
(avg_days_rms**0.5)/y_test_preds.shape[0],
(md_days_rms**0.5)/y_test_preds.shape[0]],
'LOS Model Type': ['Gradient Boosting', 'Average', 'Median'] })
fig, ax = plt.subplots()
ax = sns.barplot(x='RMSE', y='LOS Model Type', data=data)
ax.set_title('RMSE comparison of Length-of-Stay models')
ax.tick_params(top=False, left=False, right=False)
fig.savefig('images/rms_comparison.png', bbox_inches = 'tight')
Another way to look at the model could be to plot the proportion of accurate predictions in the test set versus an allowed margin of error. Other studies qualify a LOS prediction as correct if it falls within a certain margin of error. Obviously, it follows that as the margin of error allowance increases, so should the proportion of accurate predictions for all models. The gradient boosting prediction model performs better than the other constant models across the margin of error range up to 50%.
# Calculate Proportion of 'accurate' prediction as a function of allowed margin of error
reg_array = []
median_array = []
average_array = []
for i in list(range(6)):
reg_count, median_count, average_count = 0, 0, 0
for j in range(y_test_preds.shape[0]):
# Percent Difference
reg_model = (y_test_preds[j] - y_test[j])/y_test[j]
median_model = (actual_median_los - y_test[j])/y_test[j]
average_model = (actual_mean_los - y_test[j])/y_test[j]
if abs(reg_model) < i/10:
reg_count += 1
if abs(median_model) < i/10:
median_count += 1
if abs(average_model) < i/10:
average_count += 1
reg_array.append((reg_count/y_test_preds.shape[0]))
median_array.append((median_count/y_test_preds.shape[0]))
average_array.append((average_count/y_test_preds.shape[0]))
# Plot proportion of 'accurate' prediction as a function of allowed margin of error
fig, ax = plt.subplots()
ax.plot(reg_array, label='Gradient Boosting')
ax.plot(median_array, label='Median LOS model')
ax.plot(average_array, label='Average LOS model')
ax.set_title('Proportion of Accurate Predictions vs. Percent Error')
ax.set_xlabel('Allowed Margin of Error (Percent Error)')
ax.set_ylabel('Proportion of Accurate Predictions')
ax.set_xticklabels(['0%', '10%', '20%', '30%', '40%', '50%'])
ax.legend(loc='lower right');
ax.tick_params(top=False, right=False)
fig.savefig('images/rms_comparison.png', bbox_inches = 'tight')
<ipython-input-164-4d3a1a8b6514>:33: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_xticklabels(['0%', '10%', '20%', '30%', '40%', '50%'])
Hospital stays cost the health system at least a big amount of money. U.S. Hospital for example spends $377.5 billion per year in the health system and recent Medicare legislation standardizes payments for procedures performed, regardless of the number of days a patient spends in the hospital.
This incentivizes hospitals to identify patients of high LOS risk at the time of admission. Once identified, patients with high LOS risk can have their treatment plan optimized to minimize LOS and lower the chance of getting a hospital-acquired condition. Another benefit is that prior knowledge of LOS can aid in logistics such as room and bed allocation planning.